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Abstract 


Although the x statistic has been used widely as an indicator of rater agreement, there have been some 
concerns about the existence of different definitions and some peculiar results involving skewed data. This 
note evaluates different definitions of « and also demonstrates that the problem with directly comparing 
K values, especially for skewed data, can be avoided by comparing their significance. 


1 Introduction 


The « statistic seems the most commonly used measure of inter-rater agreement in Computational Lin- 
guistics, especially within the discourse/dialog community, e.g., Carletta et al. [1997]. The « statistic is 
supposed to provide a means to compare inter-rater agreements of different experiments in a meaningful 
way. To detect the ‘goodness’ of inter-rater agreement, several proposals have been made regarding accept- 
able « values. For example, Landis and Koch [1977, p. 165] considers k > 0.8 “almost perfect” (as well 
as other labels); Krippendorff [1980, p. 147] considers a > 0.8 (a closely-related measure) for “reporting 
on variables”; Emam [1999] (based on empirical distribution in Software Engineering) considers K > 0.75 
“excellent” (some other approaches are reviewed in Di Eugenio [2000, Sec. 2]). We must note that these 
proposals often come with warnings such as “clearly arbitrary” Landis and Koch [1977, p. 165], “should not 
be adopted ad hoc” Krippendorff [1980, p. 147], and a cautious description about the use of « in Carletta 
[1996, p. 252]. 

In spite of these warnings, the above-mentioned threshold values are widely used for judgment, e.g., 
Carletta [1996, p. 252] and Carletta et al. [1997, p. 25]. In this connection, several issues associated with 
the use of « have been raised, most recently by Di Eugenio [2000]. One aspect Di Eugenio [2000, Sec. 
2.1.2] points out is about the nature of data, e.g., independence among categories. Two other more technical 
issues are: (1) existence of different ways of computing chance agreement, an essential component in the k 
Statistic, and (2) the behavior of « on skewed data. In this note, we discuss the latter two issues in detail. 

As for the point (1), different ways of computing chance agreement has been pointed out in Fleiss [1971, 
p. 379], Siegel and Castellan [1988, p. 290], and Di Eugenio [2000, Sec. 2.1.1]. However, the effect of 
the difference has not been investigated in detail. As for the point (2), the effects of skewed data have been 
pointed out in Kraemer [1979, p. 470], Grove et al. [1981, p. 412], Chu-Carroll and Brown [1997, Sec. 2.2], 
and Di Eugenio [2000, Sec. 2.1.1]. Feinstein and Cicchetti [1990] seems to be the most detailed account 
of the situation.! The use of « on skewed data is often considered problematic partly because the above- 
mentioned thresholds do not appear to be applicable. One potential solution to this ‘problem’ is to compute 


'Chu-Carroll and Brown [1997, Sec. 2.2] proposes to “lower” the chance agreement in their computation, which seems to 
mislead the interpretation of their measures. 


the significance of k (instead of comparing raw k values) (Di Eugenio et al. [1998, p. 327], and later work). 
However, such a practice does not seem to be well established in the CL community. 

In this note, I echo the warning of Di Eugenio [2000], and provide examples that would illuminate the 
issues involved in the computation of « statistic. In Section 2, we compare the k statistic of Cohen [1960] 
(limited for two raters) and Fleiss [1971] (applicable to multiple raters, also adopted in a widely-cited text 
by Siegel and Castellan [1988]). A possible extension of Cohen [1960] to multiple raters is also discussed. 
In Section 3, we support the practice of Di Eugenio et al. [1998] and emphasize the necessity of computing 
significance through illustrative examples as a means to overcome difficulty comparing « values. 


2 Chance Agreement 


This section compares the k statistics of Cohen [1960] and Fleiss [1971], as well as the computation of 
chance agreement in Krippendorff [1980]. To be consistent, we mainly adopt the notation used in Siegel 
and Castellan [1988]. Since Cohen is limited to two raters, this section first focuses on that case. After 
comparing Cohen and Fleiss, we will also explore an extension of Cohen to multiple raters, which could be 
used as an alternative to Fleiss. 

Let us first consider the following data for two raters X and Y, two categories A and B, and objects 1 
through 16 (N = 16): 


Objects; 1}/2/3])]4/5/6]7]8)|9|10/ 11) 12/13] 14] 15 | 16 
Raters x A|AJ|A|AJ|A;|AJA|SA|BIB B;\}B|B|B |B (1) 
Y A|AJ|A|AJ|A;|AJA|BI\A|B B;\}B|B|B/B 


B 
B 


As a preparation, we consider the following table of joint probabilities to classify the judgments of the two 
raters. 


Raters Y 
Category A B a 

xX A P(AA) | P(AB) | P(Ay) (2) 
B P(BA) | P(BB) | P(Bx) 
y P(Ay) | P(By) 


Then, (2) can be filled in as follows: 


Raters Y 
Category | A B y 
Xx A 0.44 | 0.06 | 0.50 
B 0.06 | 0.44 | 0.50 
y 0.50 | 0.50 


The probability of actual agreement between A and B, i.e., P(A), can be computed as 


P(A) = P(AA) + P(BB) = 0.44 4+ 0.44 = 0.88 


This formula/value is the same for Cohen [1960] and Fleiss [1971]. That is, the formula of Fleiss degenerates 
to that of Cohen for two raters. 

We then need to adjust the value of P(A) against the chance (or expected) agreement, P(E). According 
to Cohen, P(E) is the probability of A and B making the same decision, shown as follows: 


P(Ec) = P(Ay) P(Ay) + P (Bx) P (By) = 0.50 x 0.50 +.0.50 x 0.50 = 0.50 
According to Fleiss, the expected agreement is as shown below, where pj; is the probability of the category 
j. 
P(Er) =) p; 
But we also note the following: 


_ #(Ax)+#(Ay) | P(Ax)+P(Ay) 
EAS 2N = 2 


Then, we can compute P(E;) as follows, the sum of the average of a rater choosing each category: 


P(E,) = (PAP ANY? , (PB +P BY’ 6) 
oc : a) 
~ (228) (CS) 


For the present data, P(Ec) = P(E). The x statistic can then be computed as a chance-adjusted value of 
P(A) shown below. 
_ P(A)—P(E) _ 0.88—0.50 


= = =0.7 
x TPE) coo. 


Here is a summary of the values. 


P(A) | P(E) | « 
Cohen | 0.88 | 0.50 | 0.75 
Fleiss | 0.88 | 0.50 | 0.75 


In general, however, Cohen and Fleiss end up with different « values. Let us next consider the following 
data: 


Objects | 1 | 2 | 3 8 | 9) 10] 11] 12) 13 | 14} 15 | 16 
Raters x A;|AJ|A;|AJ|A;A{|AJA|B]|B|B)B|B|B| BIB (4) 
Y A|;|A|A A|A}|A|]A/}AJlAJ]A/|AJ]B 


Following the same procedure, we can fill in the table (2) and compute P(A), P(E), and « as follows: 


Raters Y 


Category | A B y: 

xX A 0.50 | 0.00 | 0.50 
B 0.44 | 0.06 | 0.50 
Y 0.94 | 0.06 


P(A) |P(E)| «x | 
Cohen | 0.56 0.50 | 0.13 | 
Fleiss | 0.56 0.60 | —0.08 | 


In this case, there is a substantial difference in the k value. In fact, according to Cohen, the actual agreement 
is better than chance, while according to Fleiss, it is worse than chance (strong disagreement). The discrep- 
ancy is due to the skewed distribution of categories between the two raters. Cohen computes chance based 
on each rater’s judgment; Fleiss computes chance by averaging out the probability of all categories for each 
rater. 

More generally, it is possible to compute the difference between the chance agreements of the two 
definitions: 


P(Er) —P(Ec) 


~ (PantPan)! Goo [P (Ax) P (Av) +P (Bx) P (By) 


2 2 


~ (Padre)! (Pio) =P(B) 


2 2 
7 0.5 — 0.94 | 0.5 — 0.06 010 
2 2 


That is, Cohen and Fleiss agree if and only if P(Ay) = P(Ay) and P(By) = P(By), as seen in (1). The 
additional assumption made by Fleiss is that the distribution of categories is even for each rater. Note that 
Siegel and Castellan [1988, p. 291] states that the additional assumption made by Fleiss [1971] is that 
P(Rx), for some rater R and the category X, is “the same for all raters”. But this statement is too weak 
because even without different P (Rx) for two raters, Cohen and Fleiss agree as long as P(Ay ) = P (Ay) and 
P (By) = P(By). We can verify this by modifying (1) as follows: replacing B with A for, e.g., objects 14 
through 16 for both X and Y. 

Before proceeding, let us examine yet another way of computing chance agreement discussed in Krip- 
pendorff [1980, p. 134]. In this case, chance agreement is based on a combinatory selection process. 


P(Ex) = # (Ax) +#(Bx) #(Ax) +#(Bx) — 1 , # (Ay) + #(By) #(Ay) +#(By) — 1 
2N 2N—1 2N 2N —1 
Since this approach does not distinguish the source of the available categories, when Cohen and Fleiss 
disagree significantly, it gives a value closer to, but still different from, Fleiss. 
Observing three different ways of computing chance agreement, it is clear that we must always identify 
which x statistic is being used. Comparing between Cohen [1960] and Fleiss [1971], the stronger assumption 
of Fleiss does not seem appropriate because the formula (3) inherently includes the irrelevant values P (Ax)’, 


P(Ay)*, P(Bx)*, and P(By)’, as can be seen below. 


_ P(Ax)? +2P(Ax)P(Ay)+P(Ay)” | P(Bx)? +2P (Bx) P(By) +P (By)” 
P(Er) = 4 r 


That is, P(E;-) is diluted with these terms that correspond to raters’ agreement with themselves. Krippen- 
dorff [1980, p. 134] too, being closer to Fleiss, has such components in the computation. As a result, for the 
two-rater case, Cohen [1960] seems to be a more appropriate way of computing k. 

Unfortunately, Cohen [1960] is limited to two raters, and presumably, that is the reason Fleiss [1971] 
presents an extension. However, Fleiss does not agree with Cohen for the two-rater case due to the stronger 
assumption on chance agreement; again, the problem is inclusion of ‘self agreement’. 

Let us now examine the chance agreement of Fleiss for k raters with m categories (p; for the probability 
for the jth category; p;,- for the probability for the jth category and the rth rater): 


j=l 


m m k ; z 
neo fink (Bae) 
j=l 


As in the two-rater case, the computation includes ‘self agreement’ as can be seen as diagonal elements (in 
parentheses) in the following table. 


Pi P2 P3 ae? Pk-1 Pk 
Pi (Pipi) P1p2 P1P3 P1iPk-1 P1Pk 
P2 P2P1 (p2p2) P2P3 P2Pk-1 P2Pk 
P3 P3P\ p3p2 | (p3p3) P3Pk-1 P3Pk (5) 
Pk-1 | Pk-1P1 | Pk-1P2 | Pk-1P3 (Pr—-1Pk—-1) | Pk-1Pk 
Pe | PePi_ | PRP2_ | PRP3 PePk-1__ | (PRPx) 


An extension of Cohen [1960] that degenerates to Cohen for the two-rater case must not include the diagonal 
elements. The following, called P(Ec:), satisfies this condition by averaging the lower triangular elements 
(less diagonal elements): 


y Ee ees 
k(k—1) 

2 
Note that we can still use the general form of P(A) in Fleiss [1971]. 

In order to see the effect of the above formula, let us now apply it to Table 9.15 of Siegel and Castellan 
[1988, p. 291]. According to Fleiss, the result is k = 0.41 regardless of the actual judgment of the raters. 
As for the extended Cohen (6), the « value would vary at least between 0.41 and 0.46. Since k = 0.41 is 
already significant (p < 0.01), this particular case would not affect our judgment. However, this suggests a 
possibility of large difference in « and thus a possibility of affecting the evaluation. 

In this section, we see that the chance agreement of Fleiss [1971] is less desirable than that of Cohen 
[1960]. We also see that it is possible to generalize Cohen to multiple raters without the stronger assumption 
of Fleiss. 


P(Ec) = 


(6) 


j=l 


3 Significance 


Although the puzzling « values for skewed data has been noted and discussed in the literature, few papers 
actually demonstrate the relation between k values and their significance. This section examines the relation 
between k values and their significance with a few examples. 

First, we compute the « for the following data, where k = 0.50: 


Objects; 1}/2/3]4/5/)/6]7/]/8)|9|10/ 11) 12) 13] 14] 15 | 16 
Raters x A|AJ|A|A}|A;|AJAJA|BIB B;\}B|B\|B |B (7) 
Y A;|AJ|A;|AJ|A;A|B|B\|AJA B;\}B|B|BB 


B 
B 


If the number of objects is large, following a Central Limit theorem, we can estimate that the distribution 
of « be close to ‘normal’. Then, we can adopt the procedure found in Fleiss [1971] (repeated in Siegel and 
Castellan [1988]), and compute the significance of « values. Here, we use a special case for two raters, 
which agrees with the significance computation of Cohen [1960]. 


__ PE) = 
var (K) = N(—P(E)) (for k = 2) 
pp = ee 
var (K) 


The results are summarized below. 


P(A) | P(E) | « | var(«) | z 
Cohen | 0.75 | 0.50 | 0.50 | 0.063 | 2.00 
Fleiss | 0.75 | 0.50 | 0.50 | 0.063 | 2.00 


We can conclude that the agreement is significant (p < .05). 
Next, we consider the following data. 


Objects | 1 | 2/3 ])4 15 8 11 | 12 | 13 | 14} 15 | 16 
Raters x A;|AJ|A;|A}|A;AJ|AJA|A}]A]A/]AJlAJ]A/]A]B (8) 
Y A;|A|A;A;A;A/A|A A 


The results for this data are summarized as follows. 


P(A) | P(E)| « | var(«) | z 
Cohen |} 1.00 | 0.88 | 1.00} 0.47 | 1.46 
Fleiss | 1.00 | 0.88 | 1.00 | 0.47 | 1.46 


We conclude that the agreement is not significant (p < .05). As we can see, a high x value (even the perfect 
agreement) does not guarantee significant agreement because P(E) is already very high. That is, it is easy 
to get agreement between the raters. 

This situation involving skewed data has been referred to as “problem” Grove et al. [1981, p. 412], 
“paradox” Feinstein and Cicchetti [1990, p. 543], “highly problematic” Chu-Carroll and Brown [1997, Sec. 
2.2], or having “difficulty in the interpretation of «” [Berry, 1992]. In addition, Di Eugenio [2000, Sec. 
2.1.1] asks “why” and Kraemer [1979, p. 461] writes “its [I statistic] value as a measure of the quality of 
an observation in clinical or research contexts is not clear” partly because of this property. 

As has been explored, e.g., in Feinstein and Cicchetti [1990], the cause of this property of « is the high 
chance agreement due to skewed data. There are some alternatives to k, e.g., Kraemer [1979]. However, 
as summarized by Goldman [1992], Shrout et al. [1987, p. 176] and Cicchetti and Feinstein [1990, p. 557] 
conclude that this property of « is actually a desirable consequence of the chance agreement adjustment. 
Thus, when data is skewed, the judges must do better than chance to result in significant agreement; in some 
cases, the researcher may need to increase the number of objects. As shown in the above example, it is in 


general meaningless to directly compare k values. Instead, by computing z values, we can still compare the 
significance of different data. 

In Section 1, we introduced an extension of Cohen [1960] that is more desirable than Fleiss [1971]. To 
be able to compare the significance of « for multiple raters, we will need a way to compute significance 
for the extension. Here is a preliminary analysis adapted from Fleiss. Since the numerator of the formula 
for computing variance in [10] of Fleiss [1971, p. 380] does not depend on P(E), we can leave that part 
as is (except that we use )° Pj instead of P(E) because we replace P(E) with P(Ec’)). Since P(E) in the 
denominator refers to the chance agreement, we can use P(Ec’) in its place. Then, we obtain a formula 
similar to that of Fleiss, shown in a way similar to (9.30) in Siegel and Castellan [1988]. 


2 Lpi—(2k-3) [ep3] +2¢-2)Ep} 


an) = NET) (l—P (Eo) 


4 Conclusion 


The first point of this note is that for two raters, the « statistic of Cohen [1960] is more desirable than 
that of Fleiss [1971] because the latter requires an unnecessarily strong assumption. Although the original 
computation of Cohen is limited to two raters, we noted that it can be extended to multiple-rater cases 
without using the stronger assumption of Fleiss. 

The second point is that the so-called ‘problem’ with skewed data can be avoided by comparing the 
significance of k values. This is also possible for the extended Cohen for multiple raters, introduced in 
Section 2. 

In spite of the popularity of the « statistic, a number of issues are being reported in the literature. In 
some cases, researchers do not pay sufficient attention to the ‘meaning’ of different « statistics. In some 
other cases, researchers use arbitrary thresholds for evaluation with little justification. I hope that this note 
presents materials that are useful for delineating some of these issues. 

Finally, the MS Excel file used for computing various values in this note is available at: 
“http: //www.tcnj.edu/~komagata/pub/kappa.xls”. 
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